Session 4: Linking and joining data & SQL

ESS: Introduction to Web Scraping and Data Management for Social Scientists

Johannes B. Gruber

2024-07-25

Introduction

This Course

tinytable_d0xj8tlu04y541n7kbec
Day Session
1 Introduction
2 Data Structures and Wrangling
3 Working with Files
4 Linking and joining data & SQL
5 Scaling, Reporting and Database Software
6 Introduction to the Web
7 Static Web Pages
8 Application Programming Interface (APIs)
9 Interactive Web Pages
10 Building a Reproducible Research Project

The Plan for Today

In this session, you learn:

  • why and how to work with relational data
  • how to join data from different tables in R
  • how to join data from different tables in SQL

Via DALL-E

Relational data

Why do we need to join data?

  • It’s rare that a data analysis involves only a single data frame
  • As we learned in session 2: good tables don’t contain redundancies (information about different entities should be stored separately)
  • Most databases use a relational database management system (or RDBMS)
  • Practically, we always think of relations as a connection of two tables
  • Relations of three or more tables are always a property of the relations between each pair

Basic concept: keys

  • primary key: a variable or set of variables (which is then called a compound key) that uniquely identifies each observation
  • foreign key: a variable (or set of variables) that corresponds to a primary key in another table
df1 <- tibble(
  id = 1:6,
  capital_letters = LETTERS[1:6]
)
df1
# A tibble: 6 × 2
     id capital_letters
  <int> <chr>          
1     1 A              
2     2 B              
3     3 C              
4     4 D              
5     5 E              
6     6 F              
df2 <- tibble(
  id = c(1:5, 7),
  lowercase_letters = letters[c(1:5, 7)]
)
df2
# A tibble: 6 × 2
     id lowercase_letters
  <dbl> <chr>            
1     1 a                
2     2 b                
3     3 c                
4     4 d                
5     5 e                
6     7 g                

Joining tables in R

Source: Wickham, Çetinkaya-Rundel, and Grolemund (2023)

Two different joins

  • Mutating joins: add new variables to one data frame from matching observations in another.
  • Filtering joins: filter observations from one data frame based on whether or not they match an observation in another.

left_join

  • Mutating join
  • keeps all variables from x and y
  • keeps all rows from x
  • drops rows from y where key has no match in x

right_join

  • Mutating join
  • keeps all variables from x and y
  • keeps all rows from y
  • drops rows from x where key has no match in y
  • if you would reverse x and y, it’s the same as left_join

full_join

  • Mutating join
  • keeps all variables from x and y
  • keeps all rows from x and y

inner_join

  • Mutating join
  • keeps all variables from x and y
  • drops rows from x where key has no match in y
  • drops rows from y where key has no match in x
  • essentially only keeps complete cases

semi_join

  • Filtering join
  • only keeps variables from x
  • drops rows from x where key has no match in y

anti_join

  • Filtering join
  • only keeps variables from x
  • drops rows from x where key has a match in y

Syntax: left and right

left_join(df1, df2, by = "id")
# A tibble: 6 × 3
     id capital_letters lowercase_letters
  <dbl> <chr>           <chr>            
1     1 A               a                
2     2 B               b                
3     3 C               c                
4     4 D               d                
5     5 E               e                
6     6 F               <NA>             
right_join(df1, df2, by = "id")
# A tibble: 6 × 3
     id capital_letters lowercase_letters
  <dbl> <chr>           <chr>            
1     1 A               a                
2     2 B               b                
3     3 C               c                
4     4 D               d                
5     5 E               e                
6     7 <NA>            g                

Syntax: full and inner join

full_join(df1, df2, by = "id")
# A tibble: 7 × 3
     id capital_letters lowercase_letters
  <dbl> <chr>           <chr>            
1     1 A               a                
2     2 B               b                
3     3 C               c                
4     4 D               d                
5     5 E               e                
6     6 F               <NA>             
7     7 <NA>            g                
inner_join(df1, df2, by = "id")
# A tibble: 5 × 3
     id capital_letters lowercase_letters
  <dbl> <chr>           <chr>            
1     1 A               a                
2     2 B               b                
3     3 C               c                
4     4 D               d                
5     5 E               e                

Syntax: semi and anti join

semi_join(df1, df2, by = "id")
# A tibble: 5 × 2
     id capital_letters
  <int> <chr>          
1     1 A              
2     2 B              
3     3 C              
4     4 D              
5     5 E              

Same as:

df1 |> 
  filter(id %in% df2$id)
# A tibble: 5 × 2
     id capital_letters
  <int> <chr>          
1     1 A              
2     2 B              
3     3 C              
4     4 D              
5     5 E              
anti_join(df1, df2, by = "id")
# A tibble: 1 × 2
     id capital_letters
  <int> <chr>          
1     6 F              
df1 |> 
  filter(!id %in% df2$id)
# A tibble: 1 × 2
     id capital_letters
  <int> <chr>          
1     6 F              

Finding keys

pets <- tribble(
  ~animal, ~name,            ~breed,
  #-----       
  "cat",   "Leo",            "European Shorthair",
  "dog",   "Bruno",          "Labrador",
  "cat",   "Charlie",        "Ragdolls",
  "mouse", "Julius Cheeser", "Agouti",
  "cat",   "Felix",          "Persian",
  "horse", "Tina",           "Mustang",
  "rabbit","snowball",       "Persian",
)
breeds <- tribble(
  ~species,   ~breed,                ~life_expectancy,
  #-----
  "cat",     "European Shorthair",  14,
  "dog",     "Labrador",            11,
  "cat",     "Ragdolls",            15,
  "cat",     "Persian",             15,
  "mouse",   "Agouti",              1,
  "snake",   "Python",              25,
  "rabbit",  "Persian",             12,
)

Two criteria:

  • primary key should uniquely identify cases
  • primary key should have a corresponding foreign key (at least for some cases)

Finding keys

left_join(pets, breeds, by = c("animal" = "species"), suffix = c("_x", "_y"))
# A tibble: 13 × 5
   animal name           breed_x            breed_y            life_expectancy
   <chr>  <chr>          <chr>              <chr>                        <dbl>
 1 cat    Leo            European Shorthair European Shorthair              14
 2 cat    Leo            European Shorthair Ragdolls                        15
 3 cat    Leo            European Shorthair Persian                         15
 4 dog    Bruno          Labrador           Labrador                        11
 5 cat    Charlie        Ragdolls           European Shorthair              14
 6 cat    Charlie        Ragdolls           Ragdolls                        15
 7 cat    Charlie        Ragdolls           Persian                         15
 8 mouse  Julius Cheeser Agouti             Agouti                           1
 9 cat    Felix          Persian            European Shorthair              14
10 cat    Felix          Persian            Ragdolls                        15
11 cat    Felix          Persian            Persian                         15
12 horse  Tina           Mustang            <NA>                            NA
13 rabbit snowball       Persian            Persian                         12
left_join(pets, breeds, by = c("animal" = "species", "breed"))
# A tibble: 7 × 4
  animal name           breed              life_expectancy
  <chr>  <chr>          <chr>                        <dbl>
1 cat    Leo            European Shorthair              14
2 dog    Bruno          Labrador                        11
3 cat    Charlie        Ragdolls                        15
4 mouse  Julius Cheeser Agouti                           1
5 cat    Felix          Persian                         15
6 horse  Tina           Mustang                         NA
7 rabbit snowball       Persian                         12
right_join(pets, breeds, by = c("animal" = "species", "breed"))
# A tibble: 7 × 4
  animal name           breed              life_expectancy
  <chr>  <chr>          <chr>                        <dbl>
1 cat    Leo            European Shorthair              14
2 dog    Bruno          Labrador                        11
3 cat    Charlie        Ragdolls                        15
4 mouse  Julius Cheeser Agouti                           1
5 cat    Felix          Persian                         15
6 rabbit snowball       Persian                         12
7 snake  <NA>           Python                          25

Finding keys

full_join(pets, breeds, by = c("animal" = "species", "breed"))
# A tibble: 8 × 4
  animal name           breed              life_expectancy
  <chr>  <chr>          <chr>                        <dbl>
1 cat    Leo            European Shorthair              14
2 dog    Bruno          Labrador                        11
3 cat    Charlie        Ragdolls                        15
4 mouse  Julius Cheeser Agouti                           1
5 cat    Felix          Persian                         15
6 horse  Tina           Mustang                         NA
7 rabbit snowball       Persian                         12
8 snake  <NA>           Python                          25
inner_join(pets, breeds, by = c("animal" = "species", "breed"))
# A tibble: 6 × 4
  animal name           breed              life_expectancy
  <chr>  <chr>          <chr>                        <dbl>
1 cat    Leo            European Shorthair              14
2 dog    Bruno          Labrador                        11
3 cat    Charlie        Ragdolls                        15
4 mouse  Julius Cheeser Agouti                           1
5 cat    Felix          Persian                         15
6 rabbit snowball       Persian                         12

Finding keys

semi_join(pets, breeds, by = c("animal" = "species", "breed"))
# A tibble: 6 × 3
  animal name           breed             
  <chr>  <chr>          <chr>             
1 cat    Leo            European Shorthair
2 dog    Bruno          Labrador          
3 cat    Charlie        Ragdolls          
4 mouse  Julius Cheeser Agouti            
5 cat    Felix          Persian           
6 rabbit snowball       Persian           
anti_join(pets, breeds, by = c("animal" = "species", "breed"))
# A tibble: 1 × 3
  animal name  breed  
  <chr>  <chr> <chr>  
1 horse  Tina  Mustang
pets |> 
  filter(animal %in% breeds$species,
         breed  %in% breeds$breed)
# A tibble: 6 × 3
  animal name           breed             
  <chr>  <chr>          <chr>             
1 cat    Leo            European Shorthair
2 dog    Bruno          Labrador          
3 cat    Charlie        Ragdolls          
4 mouse  Julius Cheeser Agouti            
5 cat    Felix          Persian           
6 rabbit snowball       Persian           
pets |> 
  filter(!animal %in% breeds$species,
         !breed  %in% breeds$breed)
# A tibble: 1 × 3
  animal name  breed  
  <chr>  <chr> <chr>  
1 horse  Tina  Mustang

Many to many

  • When you get a many-to-many instead of a many-to-one join, it can indicate a problem
  • Either you have chosen the wrong key by accident or your data model is wrong
  • But the question is a conceptual one:
    • For example, several pets usually have the same species and breed, but one pet should have one breed (or maybe a mix)
    • But a single pet can have multiple owner, and each owner can have multiple pets

many-to-one

many-to-many

Case study: 2016 US primaries

library(rio)
csv_folder_url <- "https://raw.githubusercontent.com/houstondatavis/data-jam-august-2016/master/csv"
results <- import(paste(csv_folder_url, "primary_results.csv", sep = "/"))
schedule <- import(paste(csv_folder_url, "primary_schedule.csv", sep = "/"))
facts <- import(paste(csv_folder_url, "county_facts.csv", sep = "/")) |>
  select(area_name,
    population = Pop_2014_count,
    pop_change = Pop_change_pct,
    over65 = Age_over_65_pct,
    female = Sex_female_pct,
    white = Race_white_pct,
    college = Pop_college_grad_pct,
    income = Income_per_capita
  )
glimpse(results, width = 50)
Rows: 24,611
Columns: 8
$ state              <chr> "Alabama", "Alabama",…
$ state_abbreviation <chr> "AL", "AL", "AL", "AL…
$ county             <chr> "Autauga", "Autauga",…
$ fips               <dbl> 1001, 1001, 1003, 100…
$ party              <chr> "Democrat", "Democrat…
$ candidate          <chr> "Bernie Sanders", "Hi…
$ votes              <int> 544, 2387, 2694, 5290…
$ fraction_votes     <dbl> 0.182, 0.800, 0.329, …
glimpse(schedule, width = 50)
Rows: 113
Columns: 4
$ date  <chr> "2/1/16", "2/9/16", "2/20/16", "2/…
$ state <chr> "Iowa", "New Hampshire", "Nevada",…
$ party <chr> "Democrat", "Democrat", "Democrat"…
$ type  <chr> "Semi-open caucus", "Semi-closed p…
glimpse(facts, width = 50)
Rows: 3,195
Columns: 8
$ area_name  <chr> "United States", "Alabama", "…
$ population <int> 318857056, 4849377, 55395, 20…
$ pop_change <dbl> 3.3, 1.4, 1.5, 9.8, -2.1, -1.…
$ over65     <dbl> 14.5, 15.3, 13.8, 18.7, 16.5,…
$ female     <dbl> 50.8, 51.5, 51.4, 51.2, 46.6,…
$ white      <dbl> 77.4, 69.7, 77.9, 87.1, 50.2,…
$ college    <dbl> 28.8, 22.6, 20.9, 27.7, 13.4,…
$ income     <int> 28155, 23680, 24571, 26766, 1…
  • RQ 1: How was each candidate doing at specific points in time?
  • RQ 2: Was there a connection between age and candidate preference in the Democratic Party?

2016 US primaries: data wrangling

Convert the date in the schedule to date class:

schedule <- schedule %>% 
  mutate(date = lubridate::mdy(date))
schedule
          date                state      party                type
1   2016-02-01                 Iowa   Democrat    Semi-open caucus
2   2016-02-09        New Hampshire   Democrat Semi-closed primary
3   2016-02-20               Nevada   Democrat       Closed caucus
4   2016-02-27       South Carolina   Democrat        Open primary
5   2016-03-01              Alabama   Democrat        Open primary
6   2016-03-01       American Samoa   Democrat       Closed caucus
7   2016-03-01             Arkansas   Democrat        Open primary
8   2016-03-01             Colorado   Democrat       Closed caucus
9   2016-03-01              Georgia   Democrat        Open primary
10  2016-03-01        Massachusetts   Democrat Semi-closed primary
11  2016-03-01            Minnesota   Democrat         Open caucus
12  2016-03-01             Oklahoma   Democrat Semi-closed primary
13  2016-03-01            Tennessee   Democrat        Open primary
14  2016-03-01                Texas   Democrat        Open primary
15  2016-03-01              Vermont   Democrat        Open primary
16  2016-03-01             Virginia   Democrat        Open primary
17  2016-03-05               Kansas   Democrat       Closed caucus
18  2016-03-05            Louisiana   Democrat      Closed primary
19  2016-03-05             Nebraska   Democrat       Closed caucus
20  2016-03-06                Maine   Democrat       Closed caucus
21  2016-03-08     Democrats Abroad   Democrat      Closed primary
22  2016-03-08             Michigan   Democrat        Open primary
23  2016-03-08          Mississippi   Democrat        Open primary
24  2016-03-12    Northern Marianas   Democrat       Closed caucus
25  2016-03-15              Florida   Democrat      Closed primary
26  2016-03-15             Illinois   Democrat        Open primary
27  2016-03-15             Missouri   Democrat        Open primary
28  2016-03-15       North Carolina   Democrat Semi-closed primary
29  2016-03-15                 Ohio   Democrat   Semi-open primary
30  2016-03-22              Arizona   Democrat      Closed primary
31  2016-03-22                Idaho   Democrat         Open caucus
32  2016-03-22                 Utah   Democrat    Semi-open caucus
33  2016-03-26               Alaska   Democrat       Closed caucus
34  2016-03-26               Hawaii   Democrat  Semi-closed caucus
35  2016-03-26           Washington   Democrat         Open caucus
36  2016-04-05            Wisconsin   Democrat        Open primary
37  2016-04-09              Wyoming   Democrat       Closed caucus
38  2016-04-19             New York   Democrat      Closed primary
39  2016-04-26          Connecticut   Democrat      Closed primary
40  2016-04-26             Delaware   Democrat      Closed primary
41  2016-04-26             Maryland   Democrat      Closed primary
42  2016-04-26         Pennsylvania   Democrat      Closed primary
43  2016-04-26         Rhode Island   Democrat Semi-closed primary
44  2016-05-03              Indiana   Democrat        Open primary
45  2016-05-07                 Guam   Democrat       Closed caucus
46  2016-05-10        West Virginia   Democrat Semi-closed primary
47  2016-05-17             Kentucky   Democrat      Closed primary
48  2016-05-17               Oregon   Democrat      Closed primary
49  2016-06-04       Virgin Islands   Democrat       Closed caucus
50  2016-06-05          Puerto Rico   Democrat        Open primary
51  2016-06-07           California   Democrat Semi-closed primary
52  2016-06-07              Montana   Democrat        Open primary
53  2016-06-07           New Jersey   Democrat Semi-closed primary
54  2016-06-07           New Mexico   Democrat      Closed primary
55  2016-06-07         North Dakota   Democrat         Open caucus
56  2016-06-07         South Dakota   Democrat Semi-closed primary
57  2016-06-14 District of Columbia   Democrat      Closed primary
58  2016-02-01                 Iowa Republican       Closed caucus
59  2016-02-09        New Hampshire Republican Semi-closed primary
60  2016-02-20       South Carolina Republican        Open primary
61  2016-02-23               Nevada Republican       Closed caucus
62  2016-03-01              Alabama Republican        Open primary
63  2016-03-01               Alaska Republican       Closed caucus
64  2016-03-01             Arkansas Republican        Open primary
65  2016-03-01             Colorado Republican       Closed caucus
66  2016-03-01              Georgia Republican        Open primary
67  2016-03-01        Massachusetts Republican Semi-closed primary
68  2016-03-01            Minnesota Republican         Open caucus
69  2016-03-01         North Dakota Republican       Closed caucus
70  2016-03-01             Oklahoma Republican      Closed primary
71  2016-03-01            Tennessee Republican        Open primary
72  2016-03-01                Texas Republican        Open primary
73  2016-03-01              Vermont Republican        Open primary
74  2016-03-01             Virginia Republican        Open primary
75  2016-03-01              Wyoming Republican       Closed caucus
76  2016-03-05               Kansas Republican       Closed caucus
77  2016-03-05             Kentucky Republican       Closed caucus
78  2016-03-05            Louisiana Republican      Closed primary
79  2016-03-05                Maine Republican       Closed caucus
80  2016-03-06          Puerto Rico Republican        Open primary
81  2016-03-08               Hawaii Republican       Closed caucus
82  2016-03-08                Idaho Republican      Closed primary
83  2016-03-08             Michigan Republican        Open primary
84  2016-03-08          Mississippi Republican        Open primary
85  2016-03-10       Virgin Islands Republican       Closed caucus
86  2016-03-12 District of Columbia Republican   Closed convention
87  2016-03-12                 Guam Republican       Closed caucus
88  2016-03-15              Florida Republican      Closed primary
89  2016-03-15             Illinois Republican        Open primary
90  2016-03-15             Missouri Republican        Open primary
91  2016-03-15    Northern Marianas Republican       Closed caucus
92  2016-03-15       North Carolina Republican Semi-closed primary
93  2016-03-15                 Ohio Republican Semi-closed primary
94  2016-03-22       American Samoa Republican         Open caucus
95  2016-03-22              Arizona Republican      Closed primary
96  2016-03-22                 Utah Republican  Semi-closed caucus
97  2016-04-05            Wisconsin Republican        Open primary
98  2016-04-19             New York Republican      Closed primary
99  2016-04-26          Connecticut Republican      Closed primary
100 2016-04-26             Delaware Republican      Closed primary
101 2016-04-26             Maryland Republican      Closed primary
102 2016-04-26         Pennsylvania Republican      Closed primary
103 2016-04-26         Rhode Island Republican Semi-closed primary
104 2016-05-03              Indiana Republican        Open primary
105 2016-05-10             Nebraska Republican Semi-closed primary
106 2016-05-10        West Virginia Republican Semi-closed primary
107 2016-05-17               Oregon Republican      Closed primary
108 2016-05-24           Washington Republican      Closed primary
109 2016-06-07           California Republican      Closed primary
110 2016-06-07              Montana Republican      Closed primary
111 2016-06-07           New Jersey Republican Semi-closed primary
112 2016-06-07           New Mexico Republican      Closed primary
113 2016-06-07         South Dakota Republican      Closed primary

Aggregate results on the state level instead of the county level:

results_state <- results %>% 
  group_by(state, party, candidate) %>% 
  summarize(votes = sum(votes))
results_state
# A tibble: 290 × 4
# Groups:   state, party [95]
   state   party      candidate        votes
   <chr>   <chr>      <chr>            <int>
 1 Alabama Democrat   Bernie Sanders   76399
 2 Alabama Democrat   Hillary Clinton 309928
 3 Alabama Republican Ben Carson       87517
 4 Alabama Republican Donald Trump    371735
 5 Alabama Republican John Kasich      37970
 6 Alabama Republican Marco Rubio     159802
 7 Alabama Republican Ted Cruz        180608
 8 Alaska  Democrat   Bernie Sanders     440
 9 Alaska  Democrat   Hillary Clinton     99
10 Alaska  Republican Ben Carson        2401
# ℹ 280 more rows

2016 US primaries: candidates performance over time

results_state_time <- results_state |> 
  inner_join(schedule, by = c("state", "party"))
results_state_time
# A tibble: 290 × 6
# Groups:   state, party [95]
   state   party      candidate        votes date       type         
   <chr>   <chr>      <chr>            <int> <date>     <chr>        
 1 Alabama Democrat   Bernie Sanders   76399 2016-03-01 Open primary 
 2 Alabama Democrat   Hillary Clinton 309928 2016-03-01 Open primary 
 3 Alabama Republican Ben Carson       87517 2016-03-01 Open primary 
 4 Alabama Republican Donald Trump    371735 2016-03-01 Open primary 
 5 Alabama Republican John Kasich      37970 2016-03-01 Open primary 
 6 Alabama Republican Marco Rubio     159802 2016-03-01 Open primary 
 7 Alabama Republican Ted Cruz        180608 2016-03-01 Open primary 
 8 Alaska  Democrat   Bernie Sanders     440 2016-03-26 Closed caucus
 9 Alaska  Democrat   Hillary Clinton     99 2016-03-26 Closed caucus
10 Alaska  Republican Ben Carson        2401 2016-03-01 Closed caucus
# ℹ 280 more rows
results_state_time |> 
  arrange(date) |> 
  group_by(candidate) |> 
  mutate(votes_agg = cumsum(votes)) |> 
  ungroup() |> 
  ggplot(mapping = aes(x = date, y = votes_agg, colour = candidate)) +
  geom_line()

results_state_time |> 
  arrange(date) |> 
  group_by(candidate) |> 
  mutate(votes_agg = cumsum(votes)) |> 
  ungroup() |> 
  ggplot(mapping = aes(x = date, y = votes_agg, colour = candidate)) +
  geom_line() +
  facet_wrap("party")

2016 US primaries: candidates performance vs average age

results_state_facts <- results_state |> 
  inner_join(facts, by = c("state" = "area_name")) |> 
  group_by(state, party) |> 
  mutate(total_votes = sum(votes), 
         pct_votes = votes / total_votes) |> 
  ungroup()
results_state_facts
# A tibble: 290 × 13
   state   party      candidate  votes population pop_change over65 female white
   <chr>   <chr>      <chr>      <int>      <int>      <dbl>  <dbl>  <dbl> <dbl>
 1 Alabama Democrat   Bernie S…  76399    4849377        1.4   15.3   51.5  69.7
 2 Alabama Democrat   Hillary … 309928    4849377        1.4   15.3   51.5  69.7
 3 Alabama Republican Ben Cars…  87517    4849377        1.4   15.3   51.5  69.7
 4 Alabama Republican Donald T… 371735    4849377        1.4   15.3   51.5  69.7
 5 Alabama Republican John Kas…  37970    4849377        1.4   15.3   51.5  69.7
 6 Alabama Republican Marco Ru… 159802    4849377        1.4   15.3   51.5  69.7
 7 Alabama Republican Ted Cruz  180608    4849377        1.4   15.3   51.5  69.7
 8 Alaska  Democrat   Bernie S…    440     736732        3.7    9.4   47.4  66.9
 9 Alaska  Democrat   Hillary …     99     736732        3.7    9.4   47.4  66.9
10 Alaska  Republican Ben Cars…   2401     736732        3.7    9.4   47.4  66.9
# ℹ 280 more rows
# ℹ 4 more variables: college <dbl>, income <int>, total_votes <int>,
#   pct_votes <dbl>
results_state_facts |>
  filter(party == "Democrat",
    candidate %in% c("Bernie Sanders",
                     "Hillary Clinton")
  ) |>
  ggplot(aes(x = over65, y = pct_votes, colour = candidate)) +
  geom_point() +
  geom_smooth(method = "lm", formula = y ~ x)

Exercises 1

  1. See if some of the other variables influence support for Sanders or Clinton
  2. Below I prepared code that loads data from the nycflights13 package which is about flights departing NYC in 2013:
library(nycflights13)
data(airports)
data(flights)
data(weather)
data(planes)
data(airlines)

The dataset is perfect for practising joining data, as the set is split into tables that relate to each other. You can see the relation in the diagram below:

Imagine you wanted to draw (approximately) the route each plane flies from its origin to its destination. What variables would you need? What data frames would you need to combine?

  1. Add the location of the origin and destination (i.e. the lat and lon) to flights.
  2. Which airlines departed most often from the NYC airports (i.e. JFK, LGA or EWR). Use their real names not the abbreviation
  3. How could you calculate the average age of the planes departing from NYC? (hint: you can calculate averages with mean(), you might want to use na.rm = TRUE)

Intro to Structured query language (SQL)

Why learn SQL?

  • managing tables in files instead of a database is error-prone:
    • you need to keep files consistent
    • you need to maintain relational integrity
  • using files instead of a database is slow:
    • you have to read in files before doing operations on the data (although systems like arrow improve the situation)
    • you need to use your local hardware instead of a remote server (although you could SSH into a server)
    • collaboration is limited and again error-prone
  • database management system (DBMS), which solve these issues often speak Structured Query Language (SQL)

Interacting with a database management system from Weidmann (2023), p. 105.

SQL basic syntax

  • R has a generic interface to communicate with relational databases: R Database Interface or DBI
library(DBI)

To communicate with a specific database format, DBI needs a driver

library(RSQLite)
  • RSQLite: translates R commands to SQLite
  • SQLite: a free and open source, well-known embedded database engine

Creating and connecting to an SQLite database

dir.create("data", showWarnings = FALSE)
unlink("data/db.sql") # just to make sure we use a new db
db <- dbConnect(SQLite(), "data/db.sql")
db
<SQLiteConnection>
  Path: data/db.sql
  Extensions: TRUE
class(db)
[1] "SQLiteConnection"
attr(,"package")
[1] "RSQLite"
  • dbConnect(): sets up a connection
  • SQLite(): provides the driver for the specific DBMS

Let’s see what’s in the database so far:

dbGetQuery(db, "SELECT name FROM sqlite_master")
[1] name
<0 rows> (or 0-length row.names)

Nothing. SQLite databases are typically stored in a single file:

file.size("data/db.sql")
[1] 0

Our file has the size 0, which means it consists just of a place-holder.

Adding a table to the SQLite database

So far, the database is empty:

# this is a shorthand for "SELECT name FROM sqlite_master" above
dbListTables(db) 
character(0)

Before we can add data, we have to create a table in the database:

dbExecute(db,
          "CREATE TABLE df1 (
            election_id integer,
            capital_letters varchar)")
[1] 0
dbListTables(db)
[1] "df1"

Besides integer and varchar (suitable for short strings), you can also use real (numeric), dates and blob (stored as binary data).

Now we can add cases:

dbExecute(db,
          "INSERT INTO df1
            VALUES (1, 'A')")
[1] 1

And we can pull the data back out:

dbGetQuery(db, "SELECT * FROM df1")
  election_id capital_letters
1           1               A

We can start from scratch by deleting this table:

dbExecute(db, "DROP TABLE df1")
[1] 0

Adding a table with constraints

We dropped the last table, so the database is empty again

dbListTables(db)
character(0)

Let’s define the table again, but add constraints to enforce certain data:

dbExecute(db,
          "CREATE TABLE df1 (
              id INTEGER PRIMARY KEY,
              capital_letters VARCHAR(1) CHECK (capital_letters GLOB '[A-Z]'), 
              my_date DATE CHECK (my_date > '2000-01-01')
          )")
[1] 0
dbListTables(db)
[1] "df1"
  • id is now a primary key, which makes sure that it is unique
  • the field capital_letters is now constrained to capital letters
  • only dates in 2000 or later are allowed

Adding a table with constraints

Let’s try to add a case:

dbExecute(db,
          "INSERT INTO df1
            VALUES (1, 'a', '1999-12-31')")
Error: CHECK constraint failed: capital_letters GLOB '[A-Z]'

Let’s try again with a capital letter:

dbExecute(db,
          "INSERT INTO df1
            VALUES (1, 'A', '1999-12-31')")
Error: CHECK constraint failed: my_date > '2000-01-01'

OK, let’s adhere to the constraints:

dbExecute(db,
          "INSERT INTO df1
            VALUES (1, 'A', '2000-12-31')")
[1] 1

And let’s test the last constraint:

dbExecute(db,
          "INSERT INTO df1
            VALUES (1, 'A', '2000-12-31')")
Error: UNIQUE constraint failed: df1.id

Adding Data

We already know how to add a complete case:

dbExecute(db,
          "INSERT INTO df1
            VALUES (2, 'B', '2000-12-31')")
[1] 1

But you can also add a case with a missing value:

dbExecute(db,
          "INSERT INTO df1 (id, capital_letters)
            VALUES (3, 'C')")
[1] 1

OR:

dbExecute(db,
          "INSERT INTO df1 
            VALUES (4, 'D', NULL)")
[1] 1
dbGetQuery(db, "SELECT * FROM df1")
  id capital_letters    my_date
1  1               A 2000-12-31
2  2               B 2000-12-31
3  3               C       <NA>
4  4               D       <NA>

Updating Data

If we made a mistake or something changes, we can update a specific row:

dbExecute(db, "UPDATE df1
               SET my_date = '2023-07-22'
               WHERE id = 3")
[1] 1

Or we can add a new variable and calculate it from existing values:

dbExecute(db, "ALTER TABLE df1 ADD year integer")
[1] 0
dbExecute(db, "UPDATE df1
               SET year = strftime('%Y', my_date)")
[1] 4
dbGetQuery(db, "SELECT * FROM df1")
  id capital_letters    my_date year
1  1               A 2000-12-31 2000
2  2               B 2000-12-31 2000
3  3               C 2023-07-22 2023
4  4               D       <NA>   NA

Accessing Data

We have already seen the SELECT command, let’s explore what else we can do!

Access only certain fields:

dbGetQuery(db, "SELECT id, year FROM df1")
  id year
1  1 2000
2  2 2000
3  3 2023
4  4   NA

Access only certain rows:

dbGetQuery(db, "SELECT id, year 
                FROM df1
                WHERE year > 2000")
  id year
1  3 2023

Computing new values on the fly:

dbGetQuery(db, "SELECT id, my_date, strftime('%d, %m', my_date) FROM df1")
  id    my_date strftime('%d, %m', my_date)
1  1 2000-12-31                      31, 12
2  2 2000-12-31                      31, 12
3  3 2023-07-22                      22, 07
4  4       <NA>                        <NA>

Deleting Data

Two ways to get rid of data:

  • DELETE: used to remove rows from a table
  • DROP: used to remove entire database objects like tables, databases, indexes, or views
dbExecute(db, "DELETE FROM df1 
               WHERE id = 1")
[1] 1
dbGetQuery(db, "SELECT * FROM df1")
  id capital_letters    my_date year
1  2               B 2000-12-31 2000
2  3               C 2023-07-22 2023
3  4               D       <NA>   NA
dbExecute(db, "DROP TABLE df1")
[1] 0
dbGetQuery(db, "SELECT * FROM df1")
Error: no such table: df1

SQL basic syntax: summary

  • insensitive to line breaks
  • case-insensitive
  • convention to spell SQL keywords in upper case, and names of tables, columns, and functions in lower case
  • important verbs are:
    • CREATE: define a new table
    • INSERT: add new rows to a table
    • UPDATE: modify existing rows in a table
    • ALTER: modify existing fields in a table
    • SELECT: query data from one or more tables
    • DELETE: remove rows from a table
    • DROP: remove a database object (e.g., table)
dbGetQuery(db, "SELECT * 
                FROM df1
                WHERE id < 4")
dbGetQuery(db, "sELeCt id FROM Df1")

SQL basic syntax (from R)

There are a couple of useful functions in the DBI package/the driver packages. We’ve seen the dbListTables already:

dbListTables(db)
character(0)

You can also create a table from a data.frame:

dbWriteTable(conn = db, name = "df1", value = df1)
dbWriteTable(conn = db, name = "df2", value = df2)

And get information about the defined fields (fields is the equivalent of columns):

dbListFields(conn = db, name = "df1")
[1] "id"              "capital_letters"
dbListFields(conn = db, name = "df2")
[1] "id"                "lowercase_letters"

SQL basic syntax (from R)

We can read the entire table back into R:

dbReadTable(conn = db, "df1")
  id capital_letters
1  1               A
2  2               B
3  3               C
4  4               D
5  5               E
6  6               F

Delete (drop) a table:

dbRemoveTable(conn = db, "df1")

Check if a table exists

if (!dbExistsTable(conn = db, "df1")) {
  message("df1 does not exist. Creating it")
  dbWriteTable(conn = db, name = "df1", value = df1)
}

Exercises 2

  1. create a new table in db called elections with the fields: election_id, country_name, election_date, party_id, vote_share, seats, and seats_total. Choose appropriate data types for the fields (maybe peak ahead to the next exercise).
  2. Using SQL syntax, add this data to the elections table:
election_id country_name election_date party_id vote_share seats seats_total
1030 Austria 1919-02-16 97 40.75 72 170
  1. Using SQL syntax, add this data to the elections table:
election_id country_name election_date vote_share party_id
872 Belgium 1908-05-24 22.6 2422
  1. Querying data from elections, create an R object that has the election_id, party_id and vote_share fields for elections after 1910.
  2. Querying data from elections, get election_id, party_id, vote_share fields and just the year of the election.
  3. Add a new field vote_share_prop to the elections database which shows the vote share as proportion rather than a percentage.

Joining tables in SQL

Some good news

We basically already know how joining in SQL works!

SQL vs R: left_join

R

left_join(x = df1, y = df2, by = c("id" = "id"))
# A tibble: 6 × 3
     id capital_letters lowercase_letters
  <dbl> <chr>           <chr>            
1     1 A               a                
2     2 B               b                
3     3 C               c                
4     4 D               d                
5     5 E               e                
6     6 F               <NA>             

SQL

dbGetQuery(db, "SELECT * FROM df1 
                LEFT JOIN df2 
                ON df1.id = df2.id")
  id capital_letters id lowercase_letters
1  1               A  1                 a
2  2               B  2                 b
3  3               C  3                 c
4  4               D  4                 d
5  5               E  5                 e
6  6               F NA              <NA>

right_join

R

right_join(x = df1, y = df2, by = c("id" = "id"))
# A tibble: 6 × 3
     id capital_letters lowercase_letters
  <dbl> <chr>           <chr>            
1     1 A               a                
2     2 B               b                
3     3 C               c                
4     4 D               d                
5     5 E               e                
6     7 <NA>            g                

SQL

dbGetQuery(db, "SELECT * FROM df1
                RIGHT JOIN df2 
                ON df1.id = df2.id")
  id capital_letters id lowercase_letters
1  1               A  1                 a
2  2               B  2                 b
3  3               C  3                 c
4  4               D  4                 d
5  5               E  5                 e
6 NA            <NA>  7                 g

full_join

R

full_join(x = df1, y = df2, by = c("id" = "id"))
# A tibble: 7 × 3
     id capital_letters lowercase_letters
  <dbl> <chr>           <chr>            
1     1 A               a                
2     2 B               b                
3     3 C               c                
4     4 D               d                
5     5 E               e                
6     6 F               <NA>             
7     7 <NA>            g                

SQL

dbGetQuery(db, "SELECT * FROM df1
                FULL JOIN df2
                ON df1.id = df2.id")
  id capital_letters id lowercase_letters
1  1               A  1                 a
2  2               B  2                 b
3  3               C  3                 c
4  4               D  4                 d
5  5               E  5                 e
6  6               F NA              <NA>
7 NA            <NA>  7                 g

inner_join

R

inner_join(x = df1, y = df2, by = c("id" = "id"))
# A tibble: 5 × 3
     id capital_letters lowercase_letters
  <dbl> <chr>           <chr>            
1     1 A               a                
2     2 B               b                
3     3 C               c                
4     4 D               d                
5     5 E               e                

SQL

dbGetQuery(db, "SELECT * FROM df1 
                INNER JOIN df2 
                ON df1.id = df2.id")
  id capital_letters id lowercase_letters
1  1               A  1                 a
2  2               B  2                 b
3  3               C  3                 c
4  4               D  4                 d
5  5               E  5                 e

semi_join

R

semi_join(x = df1, y = df2, by = c("id" = "id"))
# A tibble: 5 × 2
     id capital_letters
  <int> <chr>          
1     1 A              
2     2 B              
3     3 C              
4     4 D              
5     5 E              
df1 |> 
  filter(id %in% df2$id)
# A tibble: 5 × 2
     id capital_letters
  <int> <chr>          
1     1 A              
2     2 B              
3     3 C              
4     4 D              
5     5 E              

SQL

dbGetQuery(db, "SELECT * FROM df1 
                WHERE EXISTS (
                  SELECT 1 FROM df2 
                  WHERE df1.id = df2.id
                )")
  id capital_letters
1  1               A
2  2               B
3  3               C
4  4               D
5  5               E

anti_join

R

anti_join(x = df1, y = df2, by = c("id" = "id"))
# A tibble: 1 × 2
     id capital_letters
  <int> <chr>          
1     6 F              
df1 |> 
  filter(!id %in% df2$id)
# A tibble: 1 × 2
     id capital_letters
  <int> <chr>          
1     6 F              

SQL

dbGetQuery(db, "SELECT * FROM df1 
                WHERE NOT EXISTS (
                  SELECT 1 FROM df2
                  WHERE df1.id = df2.id
                )")
  id capital_letters
1  6               F

Exercises 3

  1. Add the three data.frames from the 2016 primaries case study to the database (schedule, results_state, and facts)
  2. Right join results_state and facts
  3. Recreate the table results_state_time by querying and joining from db. Hint: multiple keys can be used by putting AND between them, e.g., ON df1.id = df2.id AND df1.id2 = df2.id2.

dbplyr

Working with connections in R

If you think SQL is not necessarily your thing, there is good news: dbplyr makes it possible to connect to tables and use dplyr syntax. You don’t have to attach dbplyr for that, you just need dplyr and have dbplyr installed.

library(dbplyr) # you don't usually need to load this, just have it installed

Let’s create a new table for that:

dbWriteTable(db, "diamonds", diamonds)

To connect to a table in a database, we can use tbl:

diamonds_con <- tbl(db, "diamonds")
diamonds_con
# Source:   table<`diamonds`> [?? x 10]
# Database: sqlite 3.45.2 [data/db.sql]
   carat cut       color clarity depth table price     x     y     z
   <dbl> <chr>     <chr> <chr>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
 7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
 8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
 9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39
# ℹ more rows

Working with connections in R

However, this table is just a connections, and R has pulled only the first couple of rows to display them.

object.size(diamonds)
3456848 bytes
object.size(diamonds_con)
5960 bytes

This is also why View doesn’t really work well:

View(diamonds_con)

collect

To get the object in R properly, you need to collect the data first:

x_df <- collect(diamonds_con)

x_df is now a real data.frame

object.size(x_df)
4102608 bytes
View(diamonds_con)

Supports dplyr verbs

Filter, select, mutate and summarise:

diamonds_con |> 
  filter(cut == "Premium", 
         price < 400) |> 
  select(price) |> 
  mutate(price_gbp = price * 0.77) |> 
  summarise(avg_price = mean(price),
            avg_price_gbp = mean(price_gbp))
# Source:   SQL [1 x 2]
# Database: sqlite 3.45.2 [data/db.sql]
  avg_price avg_price_gbp
      <dbl>         <dbl>
1      376.          289.

Joins also work:

df1_con <- tbl(db, "df1")
df2_con <- tbl(db, "df2")
df1_con |> 
  left_join(df2_con, by = "id") |> 
  collect()
# A tibble: 6 × 3
     id capital_letters lowercase_letters
  <int> <chr>           <chr>            
1     1 A               a                
2     2 B               b                
3     3 C               c                
4     4 D               d                
5     5 E               e                
6     6 F               <NA>             

Supports dplyr verbs

If you decide to implement this query somewhere else, you can show it after executing it:

df1_con |> 
  left_join(df2_con, by = "id") |> 
  show_query()
<SQL>
SELECT `df1`.*, `lowercase_letters`
FROM `df1`
LEFT JOIN `df2`
  ON (`df1`.`id` = `df2`.`id`)

collect() vs. compute()

  • collect() gets data from the database; like dbGetQuery()
  • compute() changes data in the database; like dbExecute()

We join the two tables again, and also compute a new column:

df1_con |> 
  left_join(df2_con, by = "id") |> 
  mutate(all_letters = paste(capital_letters, 
                             lowercase_letters)) |> 
  collect()
# A tibble: 6 × 4
     id capital_letters lowercase_letters all_letters
  <int> <chr>           <chr>             <chr>      
1     1 A               a                 A a        
2     2 B               b                 B b        
3     3 C               c                 C c        
4     4 D               d                 D d        
5     5 E               e                 E e        
6     6 F               <NA>              <NA>       

Instead of just bringing this new table to R, we can directly save it to the database

df1_con |> 
  left_join(df2_con, by = "id") |> 
  mutate(all_letters = paste(capital_letters, 
                             lowercase_letters)) |> 
  compute() # only this line is different!
# Source:   table<`dbplyr_LFFcOIpPK6`> [6 x 4]
# Database: sqlite 3.45.2 [data/db.sql]
     id capital_letters lowercase_letters all_letters
  <int> <chr>           <chr>             <chr>      
1     1 A               a                 A a        
2     2 B               b                 B b        
3     3 C               c                 C c        
4     4 D               d                 D d        
5     5 E               e                 E e        
6     6 F               <NA>              <NA>       

collect() vs. compute()

However, if we look at the table, it still looks the same

df1_con |> 
  collect()
# A tibble: 6 × 2
     id capital_letters
  <int> <chr>          
1     1 A              
2     2 B              
3     3 C              
4     4 D              
5     5 E              
6     6 F              

So what gives?

collect() vs. compute()

When we list the tables, we see what happened:

dbListTables(db)
[1] "dbplyr_LFFcOIpPK6" "df1"               "df2"              
[4] "diamonds"          "sqlite_stat1"      "sqlite_stat4"     

dbplyr by default does not overwrite the original table, but instead saves the results to a new, temporary table (they are removed when you disconnect from the db).

If we want to create a persistent version of the table, we can do:

df1_con |> 
  left_join(df2_con, by = "id") |> 
  mutate(all_letters = paste(capital_letters, 
                             lowercase_letters)) |> 
  compute(temporary = FALSE, name = "df1_updated")
# Source:   table<`df1_updated`> [6 x 4]
# Database: sqlite 3.45.2 [data/db.sql]
     id capital_letters lowercase_letters all_letters
  <int> <chr>           <chr>             <chr>      
1     1 A               a                 A a        
2     2 B               b                 B b        
3     3 C               c                 C c        
4     4 D               d                 D d        
5     5 E               e                 E e        
6     6 F               <NA>              <NA>       
tbl(db, "df1_updated")
# Source:   table<`df1_updated`> [6 x 4]
# Database: sqlite 3.45.2 [data/db.sql]
     id capital_letters lowercase_letters all_letters
  <int> <chr>           <chr>             <chr>      
1     1 A               a                 A a        
2     2 B               b                 B b        
3     3 C               c                 C c        
4     4 D               d                 D d        
5     5 E               e                 E e        
6     6 F               <NA>              <NA>       

collect() vs. compute()

And finally, if we want to update the table in place, we can bring it into R and then copy it back to overwrite the old one:

temp <- df1_con |> 
  left_join(df2_con, by = "id") |> 
  mutate(all_letters = paste(capital_letters, 
                             lowercase_letters)) |> 
  collect()
copy_to(db, temp, "df1", overwrite = TRUE)

Now, we replaced the original version:

df1_con |> 
  collect()
# A tibble: 6 × 4
     id capital_letters lowercase_letters all_letters
  <int> <chr>           <chr>             <chr>      
1     1 A               a                 A a        
2     2 B               b                 B b        
3     3 C               c                 C c        
4     4 D               d                 D d        
5     5 E               e                 E e        
6     6 F               <NA>              <NA>       

Case study: New York City flight data

Let’s move the nycflights13 into a new database:

library(nycflights13)
unlink("data/db_nyc.sql") # just to make sure we use a new db
db_nyc <- dbConnect(SQLite(), "data/db_nyc.sql")
dbWriteTable(db_nyc, "airports", airports)
dbWriteTable(db_nyc, "flights", flights)
dbWriteTable(db_nyc, "weather", weather)
dbWriteTable(db_nyc, "planes", planes)
dbWriteTable(db_nyc, "airlines", airlines)

RQ1: Is there a relationship between the age of a plane and its delays? RQ2: What weather conditions make it more likely to see a delay?

NYC flights: age vs delays

plane_delays_age <- tbl(db_nyc, "flights") |> 
  inner_join(tbl(db_nyc, "planes"), by = "tailnum", suffix = c("", "_plane")) |> 
  mutate(plane_age = year - year_plane) |> 
  select(arr_delay, plane_age) |> 
  filter(!is.na(arr_delay),
         !is.na(plane_age)) |> 
  collect()
plane_delays_age |> 
  group_by(plane_age) |> 
  summarise(avg_delay = mean(arr_delay)) |> 
  ggplot(aes(x = plane_age, y = avg_delay)) +
  geom_point() +
  geom_smooth(method = "lm", formula = y ~ x)

NYC flights: weather conditions vs delays

plane_delays_weather <- tbl(db_nyc, "flights") |> 
  inner_join(tbl(db_nyc, "weather"), by = c("time_hour", "origin")) |> 
  filter(!is.na(temp), !is.na(arr_delay), !is.na(wind_speed)) |> 
  collect()
plane_delays_weather |>
  ggplot(aes(x = temp, y = arr_delay)) +
  geom_point() +
  geom_smooth(method = "lm", formula = y ~ x)

plane_delays_weather |>
  ggplot(aes(x = wind_speed, y = arr_delay)) +
  geom_point() +
  geom_smooth(method = "lm", formula = y ~ x)

Should you learn SQL properly?

It depends!

  • how often and how advanced do you want to use it?
  • do you want to stay in the R ecosystem or extend your knowledge beyond it?
  • how many clients will use your database?

There is an interesting video discussing the question in more detail here: https://youtu.be/gdzONbwfWk0

Execises 4

  1. Right join results_state and facts using dbplyr instead of dbGetQuery
  2. Recreate the table results_state_time by querying and joining from db (using dbplyr instead of dbGetQuery).
  3. Recreate results_state_facts using dbplyr. Don’t forget to add total_votes and pct_votes
  4. Extract the SQL query from your code in 1. and run it with dbGetQuery

Homework

Tomorrow, we will use the software Docker to simulate a database server. Install the software and make sure it works. You can follow the guide I made here:

https://youtu.be/iMyCdd5nP5U

When done, dbDisconnect

Whenever you are done working with a database, you should disconnect from it:

dbDisconnect(db)
  • This closes the connection, discards all pending work, and frees resources

Wrap Up

Save some information about the session for reproducibility.

Show Session Info
sessionInfo()
R version 4.4.1 (2024-06-14)
Platform: x86_64-pc-linux-gnu
Running under: EndeavourOS

Matrix products: default
BLAS:   /usr/lib/libblas.so.3.12.0 
LAPACK: /usr/lib/liblapack.so.3.12.0

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

time zone: Europe/London
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] dbplyr_2.5.0       RSQLite_2.3.6      DBI_1.2.2          nycflights13_1.0.2
 [5] rio_1.0.1          lubridate_1.9.3    forcats_1.0.0      stringr_1.5.1     
 [9] dplyr_1.1.4        purrr_1.0.2        readr_2.1.5        tidyr_1.3.1       
[13] tibble_3.2.1       ggplot2_3.5.1      tidyverse_2.0.0    tinytable_0.3.0.10

loaded via a namespace (and not attached):
 [1] utf8_1.2.4        generics_0.1.3    lattice_0.22-6    stringi_1.8.4    
 [5] hms_1.1.3         digest_0.6.35     magrittr_2.0.3    evaluate_0.23    
 [9] grid_4.4.1        timechange_0.3.0  blob_1.2.4        fastmap_1.1.1    
[13] Matrix_1.7-0      R.oo_1.26.0       jsonlite_1.8.8    R.utils_2.12.3   
[17] mgcv_1.9-1        fansi_1.0.6       scales_1.3.0      cli_3.6.3        
[21] rlang_1.1.4       R.methodsS3_1.8.2 splines_4.4.1     bit64_4.0.5      
[25] munsell_0.5.1     cachem_1.0.8      withr_3.0.0       yaml_2.3.8       
[29] tools_4.4.1       tzdb_0.4.0        memoise_2.0.1     colorspace_2.1-0 
[33] curl_5.2.1        vctrs_0.6.5       R6_2.5.1          lifecycle_1.0.4  
[37] bit_4.0.5         pkgconfig_2.0.3   pillar_1.9.0      gtable_0.3.5     
[41] glue_1.7.0        data.table_1.15.4 xfun_0.44         tidyselect_1.2.1 
[45] rstudioapi_0.16.0 knitr_1.46        farver_2.1.2      nlme_3.1-164     
[49] htmltools_0.5.8.1 rmarkdown_2.26    labeling_0.4.3    compiler_4.4.1   

References

Weidmann, Nils B. 2023. Data Management for Social Scientists: From Files to Databases. 1st ed. Cambridge University Press. https://doi.org/10.1017/9781108990424.
Wickham, Hadley, Mine Çetinkaya-Rundel, and Garrett Grolemund. 2023. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. 2nd edition. Beijing Boston Farnham Sebastopol Tokyo: O’Reilly.